Global Optimization for Value Function Approximation 



Global Optimization for Value Function Approximation 

Marek Petrik petrik@cs.umass.edu 

Department of Computer Science 
University of Massachusetts 
Amherst, MA 01003, USA 

Shlomo Zilberstein shlomo@cs.umass.edu 
Department of Computer Science 

University of Massachusetts 
Amherst, MA 01003, USA 

Editor: 

Abstract 

Existing value function approximation methods have been successfuUy used in many ap- 
plications, but they often lack useful a priori error bounds. We propose a new approximate 
bilinear programming formulation of value function approximation, which employs global 
optimization. The formulation provides strong a priori guarantees on both robust and 
expected policy loss by minimizing specific norms of the Bellman residual. Solving a bi- 
linear program optimally is NP-hard, but this is unavoidable because the Bellman-residual 
minimization itself is NP-hard. We describe and analyze both optimal and approximate 
algorithms for solving bilinear programs. The analysis shows that this algorithm offers 
a convergent generalization of approximate policy iteration. Wc also briefly analyze the 
behavior of bilinear programming algorithms under incomplete samples. Finally, we demon- 
strate that the proposed approach can consistently minimize the Bellman residual on simple 
benchmark problems. 

Keywords: value function approximation, Markov decision processes, reinforcement 
learning, approximate dynamic programming 

1. Motivation 

Solving large Markov Decision Problems (MDPs) is a very useful, but computationally 
challenging problem addressed widely in the AI literature, particularly in the area of rein- 
forcement learning. It is widely accepted that large MDPs can only be solved approximately. 
The commonly used approximation methods can be divided into three broad categories: 1) 
policy search, which explores a restricted space of all policies, 2) approximate dynamic pro- 
gramming, which searches a restricted space of value functions, and 3) approximate linear 
programming, which approximates the solution using a linear program. While all of these 
methods have achieved impressive results in many application domains, they have significant 
limitations. 

Policy search methods rely on local search in a restricted policy space. The policy may 
be represented, for example, as a finite-state controller (Stanley and Miikkulainen, 2004) or 
as a greedy policy with respect to an approximate value function (Szita and Lorincz, 2006). 
Policy search methods have achieved impressive results in such domains as Tetris (Szita and 
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Lorincz, 2006) and helicopter control (Abbeel et al., 2006). However, they are notoriously 
hard to analyze. We are not aware of any established theoretical guarantees regarding the 
quality of the solution. 

Approximate dynamic programming (ADP) methods iteratively approximate the value 
function (Bertsekas and loffe, 1997; Powell, 2007; Sutton and Barto, 1998). They have been 
extensively analyzed and are the most commonly used methods. However, approximate 
dynamic programming methods typically do not converge and they only provide weak guar- 
antees of approximation quality. The approximation error bounds are usually expressed in 
terms of the worst-case approximation of the value function over all policies (Bertsekas and 
loffc, 1997). In addition, most available bounds are with respect to the Lqo norm, while the 
algorithms often minimize the L2 norm. While there exist some L2-based bounds (Munos, 
2003), they require values that are difficult to obtain. 

Approximate linear programming (ALP) uses a linear program to compute the approxi- 
mate value function in a particular vector space (de Farias, 2002). ALP has been previously 
used in a wide variety of settings (Adelman, 2004; de Farias and van Roy, 2004; Guestrin 
et al., 2003). Although ALP often does not perform as well as ADP, there have been some 
recent efforts to close the gap (Petrik and Zilberstein, 2009). ALP has better theoretical 
properties than ADP and policy search. It is guaranteed to converge and return the closest 
Li-norm approximation v of the optimal value function v* up to a multiplicative factor. 
However, the Li norm must be properly weighted to guarantee a small policy loss, and 
there is no reliable method for selecting appropriate weights (de Farias, 2002). 

To summarize, the existing reinforcement learning techniques often provide good so- 
lutions, but typically require significant domain knowledge (Powell, 2007). The domain 
knowledge is needed partly because useful a priori error bounds are not available, as men- 
tioned above. Our goal is to develop a more reliable method that is guaranteed to minimize 
bounds on the policy loss in various settings. 

We present new formulations of value function approximation that provably minimize 
bounds on the policy loss using global optimization. Most of these bounds do not rely 
on values that are hard to obtain, unlike, for example, approximate linear programming. 
The focus of the work is on two broad bound minimization approaches: 1) minimizing Lqo 
bounds, and 2) minimizing weighted Li norm bounds on the policy loss. In some sense, the 
formulations minimize the bounds by unifying policy value-function search methods. 

We start with a description of the framework and notation in Section 2 and the de- 
scription of value function approximation in Section 3. Then, in Section 4, we describe the 
proposed approximate bilinear programming (ABP) formulations. Bilinear programs are 
typically solved using global optimization methods, which we briefly discuss in Section 5. 
A drawback of the bilinear formulation is that solving bilinear programs may require expo- 
nential time. We also show in Section 5 that this is unavoidable, because minimizing the 
approximation error bound is in fact NP-hard. 

In practice, only sampled versions of ABPs are often solved. While a thorough treat- 
ment of sampling is beyond the scope of this paper, we examine the impact of sampling and 
establish some guarantees in Section 6. Unlike classical sampling bounds on approximate 
linear programming, we describe bounds that apply to the worst-case error. Section 7 shows 
that ABP is related to other approximate dynamic programming methods, such as approx- 
imate linear programming and policy iteration. Section 8 demonstrates the applicability 
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of ABP using common reinforcement learning benchmark problems. Technical proofs are 
provided in the appendix. 

The general setting considered in this paper is a restriction of reinforcement learning. 
Reinforcement learning methods can use samples without requiring a model of the environ- 
ment. The methods we propose can also be based on samples, but they require additional 
structure. In particular, they require that all or most actions are sampled for every state. 
Such samples can be easily generated when a model of the environment is available. 

2. Framework and Notation 

This section formally defines the framework and the notation we use. We also define Markov 
decision processes and the approximation errors involved. Markov decision processes come 
in many flavors based on the objective function that is optimized. This work focuses on the 
infinite horizon discounted MDPs, which are defined as follows. 

Definition 1 (e.g. (Puterman, 2005)). A Markov Decision Process is a tuple (5, A, P, r, a). 
Here, 5 is a finite set of states, ^ is a finite set of actions, P : 5 x ,4 x 5 i-> [0, 1] is the 
transition function {P{s, a, s') is the probability of transiting to state s' from state s given 
action a), and r : SxA<-^ is a reward function. The initial distribution is: a : S [0, 1], 
such that YlseS '^(*) ~ 

The goal is to find a sequence of actions that maximizes 7-discounted discounted cumulative 
sum of the rewards, also called the return. A solution of a Markov decision process is a 
policy, which is defined as follows. 

Definition 2. A deterministic stationary policy tt : S ^ A assigns an action to each state 
of the Markov decision process. A stochastic policy policy tt : S x A<-^ [0,1]. The set of all 
stochastic stationary policies is denoted as 11 and satisfies X^ag_4 7r(s, a) = 1. 

General non-stationary policies may take different actions in states in different time- 
steps. We limit our treatment to stationary policies, since for infinite-horizon MDPs there 
exists an optimal stationary and deterministic policy. We also consider stochastic policies 
because they are more convenient to use in some settings that we consider. 

The transition and reward functions for a given policy are denoted by Pt^ and r-j^. The 
value function update for a policy tt is denoted by L^^, and the Bellman operator is denoted 
by L. That is: 

Lt^v = Pt^v -|- Lv = maxLTT?^. 

Tren 

The optimal value function, denoted v*, satisfies v* = Lv*. 

We assume a vector representation of the policy tt G MI*^!!-^!. The variables tt are defined 
for all state-action pairs and represent policies. That is 7r(s,a) represents the probability 
of taking action a G .4 in state s G 5. The space of all correct (stochastic) policies can be 
represented using a set of linear equations: 

7r(s, a) = 1 Vs G 5 

7r(s,a)>0 yseS,yaeA 
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These inequalities can be represented using matrix notation as follows. 

Btt = 1 
vr > 0, 

where the matrix B : \S\ x {\S\ ■ \A\) is defined as follows. 

Bis',{s,a))-- 



1 s = s' 
otherwise 



We use and 1 to denote vectors of all zeros or ones of the appropriate size respectively. 
The symbol I denotes an identity matrix of the appropriate dimension. 

In addition, a policy tt induces a state visitation frequency «^ : 5 — )■ M, defined as 
follows: 



The return of a policy depends on the state-action visitation frequencies and a^v-n- = r^u.,^. 
The optimal state-action visitation frequency is u-,^*. State-action visitation frequency u : 
(S X ^ — )■ M is defined for all states and actions. Notice the missing subscript. We use Ua to 
denote the part of u that corresponds to action a ^ A. State-action visitation frequencies 
must satisfy: 

To formulate approximate linear and bilinear programs, it is necessary to restrict the 
value functions so that their Bellman residuals are non-negative (or at least bounded from 
below). We call such value functions transitive-feasible and define them as follows. 

Definition 3. A value function is transitive-feasible when v > Lv. The set of transitive- 
feasible value functions is: 



JC = {veW^\\v>Lv}. 
Given some e > 0, the set of e-transitive-feasible value functions is: 

]C{e) = {veM}^\\v>Lv-el}. 



Notice that the optimal value function v* is transitive-feasible. The following lemma 
summarizes the key property of transitive-feasible value functions: 

Lemma 4. Transitive feasible value functions form an upper bound on the optimal value 
function. If v E /C(e) is an e-transitive-feasible value function, then 

v>v*--^l. 
1-7 
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3. Value Function Approximation 

This section describes the basic methods for value function approximation. MDPs used in 
practical applications are often too large for the optimal policy to be computed precisely. 
In these cases, we first calculate an approximate value function v and then take the greedy 
policy TT with respect to it. The quality of such a policy can be characterized using its value 
function in one of the following two main ways. 

Definition 5 (Policy Loss). Let vr be a policy computed from value function approximation. 
The expected policy loss measures the expected loss of tt, defined as follows: 

11*11 T * T /i \ 

\\V - fTrlll.a = a V -a Vt, (1) 

where ||2;||i,c denotes the weighted Li norm: ||a;||i^c = 

The robust policy loss measures the worst-case loss of tt, defined as follows: 

lb* — ■WttIIoo = max|i;*(s) — ■U7r(s)| (2) 

The expected policy loss captures the total loss of discounted reward when following 
the policy tt instead of the optimal policy assuming the initial distribution. The robust 
policy loss ignores the initial distribution and, in some sense, measures the difference for 
the worst-case initial distribution. 

A set of state features is a necessary component of value function approximation. These 
features must be supplied in advance and must capture the essential structure of the prob- 
lem. The features arc defined by mapping each state s to a vector (j){s) of features. We 
denote (^j : 5 ^ M to be a function that maps states to the value of feature i: 

0i(s) = {(t){s))i. 

The desirable properties of the features depend strongly on the algorithm, samples, and 
attributes of the problem; the tradeoffs are not yet fully understood. The function can 
also be treated as a vector, similarly to the value function v. 

Value function approximation methods compute value functions that can be represented 
using the state features. We call such value functions representable and define them below. 

Definition 6. Given a convex polyhedral set M. C rI*^!, a value function v is representable 
(in M) ifveM. 

Many methods that compute a value function based on a given set of features have been 
developed, such as neural networks and genetic algorithms (Bertsekas and loffe, 1997). Most 
of these methods are extremely hard to analyze, computationally complex, and hard to use. 
Moreover, these complex methods do not satisfy the convexity assumption in Definition 6. 
A simpler, more common, method is linear value function approximation. In linear value 
function approximation, the value function of state s is represented as a linear combination 
of nonlinear features (f){s). Linear value function approximation is easy to apply and analyze. 

Linear value function approximation can be expressed in terms of matrices as follows. 
Let the matrix ^ : \S\ xm represent the features for the state-space, where m is the number 
of features. The rows of the feature matrix also known as the basis, correspond to the 
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features of the states 4>{s). The feature matrix can be defined in one of the following two 
ways: 



$ = 



- - 



V / 

The value function v is then represented as v = and the set of representable functions 
is = colspan (<I>). 

The goal of value function approximation is not just to obtain a good value function 

V but a policy with a small policy loss. Unfortunately, the policy loss of a greedy policy, 
as formulated in Definition 5, depends non-trivially on the approximate value function v. 
Often, the only reliable method of precisely computing the policy loss is to simulate the 
policy, which can be very costly. Value function approximation methods, therefore, optimize 
bounds on the policy loss. 

Theorem 7. [Robust Policy Loss, e.g. (Williams and Baird, 1994)] Let ir be a greedy policy 
with respect to a value function v. Then: 

\\V* - -WTrlloo < , ^ \\v - Lv\\oo- 

1 - 7 

In addition, if v E IC, the policy loss is minimized for the greedy policy and: 

\\v* - ■UttIIoo < rr-- — 11'" - Lv\\^. 
1-7 

The bounds above ignore the initial distribution and may often be overly conserva- 
tive. We establish new bounds on the expected policy loss that also consider the initial 
distribution. 

Theorem 8. [Expected Policy Loss] Let tt be a greedy policy with respect to a, value function 

V and let the state-action visitation frequencies of tt be bounded as u < u-j^ < u. Then: 

\\v* — v^^Wi^a = a^v* — a^v + u1 {v — Lv) 

< ot^v* — Oi^v + u^ [v — Lv\_ + v]^ [v — Lv\j_ . 

The state-visitation frequency Uj^ depends on the initial distribution a, unlike v* . In addi- 
tion, when V E )C, the bound is: 

\\v* - Vn\\l,a < -\\v* - v\\l^a + ||^ - Lv\\l^u 

\\v* - V^^Wi a < -\\v* - v\\l,a + ~ -^''^lloo 

1 — 7 

Here we use [x]_^ = max{a:, 0} componentwise. 
Proof. The bound is derived as follows: 

T * T 

a V — a I 



T * 

= a V 




-iPn)- 


a^)v 


T * 

= a V 


- rjn^ + (nj(l 


-iP.)- 




T * 

= a V 


- rju^r + 


- lPn)v - 


a^v 


T * 

= a V 


- a^v + ul ((I - 


- jP^)v - 




T * 

= a V 


— a^v + (u - 


Lv). 
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We used the fact that itj(l — 7^,^) — a""" = 0, which follows from the definition of state-action 
visitation frequencies and that v* > v-,^. The inequalities for v E IC follow from Lemma 38, 
Lemma 43, and the trivial version of the Holder's inequality: 

a^v* — a^v = —\\v* — v\\i,a 

ul {v - Lv) < 11^ - -^^lloo = j"^!!"^ ~ -^^lloo 

□ 

Notice that the bounds in Theorem 8 can be minimized even without knowing the 
optimal V* . The optimal value function v* is independent of the approximate value function 
V and the greedy policy tt depends only on v. 

Remark 9. The bounds in Theorem 8 generalize the bounds established by de Farias (2002, 
Theorem 1.3), which state that whenever u G /C: 

\\V* - Vn\\l,u < ~ '^lll,{l-7)«- 

This bound is a special case of Theorem 8 because a^v* — a^v < and: 

\\v - Lv\\i^u < \\v* - v\\i,u < ^^^^^ll""* - ^111,(1-1)11^ 

from V* < Lv < v. The proof of Theorem 8 also simplifies the proof of Theorem 1.3 in 
(de Farias, 2002). 

The methods that we propose require the following standard assumption. 

Assumption 10. All multiples of the constant vector 1 are representable in M. That is, 
for all A; e M we have that kl e M. 

Notice that the representation set M. satisfies Assumption 10 when a first column of $ 
is 1. The impact of including the constant feature is typically negligible because adding a 
constant to the value function does not change the greedy policy. 

Value function approximation algorithms are typically variations of the exact algorithms 
for solving MDPs. Hence, they can be categorized as approximate value iteration, approx- 
imate policy iteration, and approximate linear programming. The ideas of approximate 
value iteration could be traced to Bellman (1957), which was followed by many additional 
research efforts (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998; Powell, 2007). Be- 
low, we only discuss approximate policy iteration and approximate linear programming, 
because they are the most closely related to our approach. 

Approximate policy iteration (API) is summarized in algorithm 1. The function Z{7r) 
denotes the specific method used to approximate the value function for the policy tt. The 
two most commonly used methods — Bellman residual approximation and least-squares ap- 
proximation (Lagoudakis and Parr, 2003) — minimize the L2 norm of the Bellman residual. 

The approximations based on minimizing L2 norm of the Bellman residual are common 
in practice since they are easy to compute and often lead to good results. Most theoretical 
analyses of API, however, assume minimization of the norm of the Bellman residual: 

2:(7r) G argmin 11(1 - 7P^)?; - TttIIoo (3) 
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Algorithm 1: Approximate policy iteration, where Z{tt) denotes a custom value 
function approximation for the policy tt. 

1 vTo, /c rand, 1 ; 

2 while TTfc 7^ 7^k-i do 

3 Vk-^ Z{'Kk-i) ; 

4 7rfe(s) <(- argmaxag^r(s,a) + 7X;s'g5-P(s,o,s')^fe(s) Vs G 5 ; 

5 A; A; + 1 ; 



Loo-API is shown in algorithm 1, where Z{'k) is calculated using the following program: 

min (f) 

(t>,V 

s.t. (I - -fP^)v + 10 > 

-{I--fP^)v + l(f)>-r^ 
V eM 

We are not aware of convergence or divergence proofs of Loo-API, and this analysis is 
beyond the scope of this paper. Theoretically, it is also possible to minimize the Li norm 
of the Bellman rcsidiial, but we are not aware of any study of such an approximation. 

In the above description of API, we assumed that the value function is approximated for 
all states and actions. This is impossible in practice due to the size of the MDP. Instead, API 
only relies on a subset of states and actions, provided as samples. API is not guaranteed to 
converge in general and its analysis is typically in terms of limit behavior. The limit bounds 
are often very loose. We discuss the performance of API and how it relates to approximate 
bilinear programming in more detail in Section 7. 

Approximate linear programming — a method for value function approximation — is 
based on the linear program formulation of exact MDPs: 

min > c(s)v{s) 

s.t. v{s) — 7 

s'es 

We use ^ as a shorthand notation for the constraint matrix and b for the right-hand 
side. The value c represents a distribution over the states, usually a uniform one. That 
is, X^3g5c(s) = 1. The linear program (5) is often too large to be solved precisely, so it is 
approximated to get an approximate linear program by assuming that v E M (de Farias 
and van Roy, 2003), as follows: 

min c^f 

V 

s.t. Av>b (ALP-Li) 
V G M 

The constraint v £ Ai denotes the approximation. To actually solve this linear program, 
the value function is represented as v = <&x. Assumption 10 guarantees the feasibility of 
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the ALP. The optimal solution of the ALP, v, satisfies: v > v*. Therefore, the objective of 
(ALP-Li) represents the minimization of \\v — v*\\i^c (de Farias, 2002). 

Approximate linear programming is guaranteed to converge to a solution and minimize 
a weighted Li norm on the solution quality. 

Theorem 11 (e.g. (de Farias, 2002)). Given Assumption 10, let v be the solution of the 
approximate linear program (ALP-Li). If c = a then 

2 

\\v* - v\\i^a < z min ||i;* - v\\oo- 

1 — 7 vi^M 

The difficulty with the solution of ALP is that it is hard to derive guarantees on the 
policy loss based on the bounds in terms of the Li norm; it is possible when the objective 
function c represents u, as Remark 9 shows. In addition, the constant 1/(1 — 7) be 
very large when 7 is close to 1. 

Approximate linear programs are often formulated in terms of samples instead of the full 
formulation above. The performance guarantees are then based on analyzing the probability 
that a large number of constraints is violated. It is generally hard to translate the constraint 
violation bounds to bounds on the quality of the value function and the policy. 



4. Bilinecir Program Formulations 

This section shows how to formulate value function approximation as a separable bilinear 
program. Bilinear programs are a generalization of linear programs with an additional 
bilinear term in the objective function. A separable bilinear program consists of two linear 
programs with independent constraints and are fairly easy to solve and analyze. 

Definition 12 (Separable Bilinear Program). A separable bilinear program in the normal 
form is defined as follows: 

min sJw + rJx + x^Cy + rjy + sjz 

w,x\y,z 

S.t. Aix + Biw = bi A2y + B2Z = 62 (BP-m) 

w,x >0 y,z >0 

The objective of the bilinear program (BP-m) is denoted as f{w,x,y,z). We separate 
the variables using a vertical line and the constraints using different columns to emphasize 
the separable nature of the bilinear program. In this paper, we only use separable bilinear 
programs and refer to them simply as bilinear programs. 

We present three different approximate bilinear formulations that minimize the following 
bounds on the approximate value function. 

1. Robust policy loss: Minimizes Hf* — t'Trlloo by minimizing the bounds in Theorem 7: 

min \\v* — Vt^Woo < min \\v — Lv\\oo 

Tren veM 1 — 7 

2. Expected policy loss: Minimizes — 'u)7r||i^o! by minimizing the bounds in Theorem 8: 



mm \\v 
Tren 



mm \\v 
Tren 



- VT^\\i,a < oJv* + min ( -a^v + — \\v - Lv\\oo] 

veM \ 1 — 7 / 

— ■u,rl|i a < cx^v* + min i—a^v + \\v — ) . 

' veM \ / 
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3. The sum ofk largest errors: This formulation represents a hybrid between the robust 
and expected formulations. It is more robust than simply minimizing the expected 
performance but is not as sensitive to worst-case performance. 
The appropriateness of each formulation depends on the particular circumstances of the 
domain. For example, minimizing robust bounds is advantageous when the initial distribu- 
tion is not known and the performance must be consistent under all circumstances. On the 
other hand, minimizing expected bounds on the value function is useful when the initial 
distribution is known. 

In the formulations described below, we initially assume that samples of all states and 
actions arc Tised. This means that the precise version of the operator L is available. To solve 
large problems, the number of samples would be much smaller; either simply subsampled 
or reduced using the structure of the MDP. Reducing the number of constraints in linear 
programs corresponds to simply removing constraints. In approximate bilinear programs it 
also reduces the number of some variables, as Section 6 describes. 

The formulations below denote the value function approximation gencrically hy v G Ai. 
That restricts the value functions to be represcntable using features. Representable value 
functions v can be replaced by a set of variables x as v = This reduces the number of 
variables to the number of features. 



4.1 Robust Policy Loss 

The solution of the robust approximate bilinear program minimizes the Lqo norm of the 
Bellman residual — Li;||oo- This minimization can be formulated as follows. 

min TT^A + A' 

TT I X,X',V 

s.t. Bit = 1 Av- b>0 

TT > X + X'l> Av-b (ABP-Loo) 
A, A' > 
V eM 



All the variables are vectors except A', which is a scalar. The matrix A represents con- 
straints that are identical to the constraints in (ALP-Li). The variables A correspond 
to all state-action pairs. These variables represent the Bellman residuals that are being 
minimized. This formulation offers the following guarantees. 

Theorem 13. Given Assumption 10, any optimal solution (#, -u. A, A') of the approximate 
bilinear program (ABP-Lqo) satisfies: 



7r"'"A + A' = ||L{} — {;||oo < niin \\Lv 



< 2 min \\Lv — v\\oo 

< 2(1 -|- 7) min \\v — v* 

veM 
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Moreover, there exists an optimal solution tt that is greedy with respect to v for which the 
policy loss is bounded by: 

\\v* - VnWoo < T— — (mill \\Lv - v\\oo 



V 111111 
veM 

It is important to note that the theorem states that solving the approximate bihnear 
program is equivalent to minimization over all representable value functions, not only the 
transitive-feasible ones. This follows by subtracting a constant vector 1 from v to balance 
the lower bounds on the Bellman residual error with the upper ones. This reduces the 
Bellman residual by 1/2 without affecting the policy. Finally, note that whenever v* e M, 
both ABP and ALP will return the optimal value function v* . 

To prove the theorem, we first define the following linear program, which solves for the 
Loo norm of the Bellman update L^r for fixed value function v and policy tt. 

/i (tt, f ) = min 7r"''A + A' 

A, A' 

s.t. lX' + X>Av-b (6) 

A > 

The linear program (6) corresponds to the bilinear program (ABP-Loo) with a fixed policy 
TT and value function v. 

Lemma 14. Let v E K. be a transitive-feasible value function and let tt be a policy. Then: 

fiin,v) > \\v - L^v\\ 

OO) 

with an equality for a deterministic policy tt. 

Proof. The dual of the linear program (6) is the following program. 

max {Av — b) 

X 

s.t. X < vr 
X > 

Note that replacing l^x = 1 by x < 1 preserves the properties of the linear program 
and would add an additional constraint in (ABP-Loo): A' > 0. 

First, we show that fi{Tr,v) > \\L-,^v — f ||oo- Because v is feasible in the approximate 
bilinear program (ABP-Loo), Av — b > and v > Lv from Lemma 38. Let state s be the 
state in which t = WL^^v — v\\oo is achieved. That is: 

t = v{s) - ^ tt{s, a) I r(s, a) + ^ 7^(s', s, a)v{s') J . 

a&A \ s'eS / 

Now let x{s,a) = 7r(s,a) for all a G A. This is a feasible solution with value t, from the 
stochasticity of the policy and therefore a lower bound on the objective value. 
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To show the equahty for a deterministic pohcy tt, we show that /i(vr, v) < \\Lv — v\\oo, 
using that vr G {0, 1}. Then let x* be an optimal solution of (7). Define the index of x* 
with the largest objective value as: 

i G arg max (Av — b)(i). 

{i\x*({}>0} 

Let solution x'{i) = 1 and x'{j) = for j ^ i, which is feasible since 7r(i) = 1. In addition: 

{Av - b){i) = \\Lt^v - v\\oo- 

Now {x*)'^(Av — b)< {x')'^{Av — h) = \\Lt^v — v\\oo, from the fact that i is the index of the 
largest element of the objective function. □ 

When the policy tt is fixed, the approximate bilinear program (ABP-Loo) becomes the 
following linear program: 

/2 (tt) = min ■k'^X + A' 

\,\' ,v 

s.t. Av-b>Q 

+ >Av-b (8) 
A > 
V & M 

Using Lemma 38 , this linear program corresponds to: 

/2(7r) = min /i(7r,'i;). 

Then it is easy to show that: 

Lemma 15. Given apolicyir, letv be an optimal solution of the linear program (8). Then: 

/2(7r) = IlL^u - v\\oo < min \\L^v - v\\co- 

veMnK 

When V is fixed, the approximate bilinear program (ABP-Lqo) becomes the following 
linear program: 

/3(w)=min f2{TT,v) 

TT 

s.t. Stt = 1 (9) 

TT > 

Note that the program is only meaningful if v is transitive-feasible and that the function 
/2 corresponds to a minimization problem. 

Lemma 16. Let v E MCiK be a transitive-feasible value function. There exists an optimal 
solution TT of the linear program (9) such that: 

1. TT represents a deterministic policy 

2. Lj^v = Lv 

3. \\Lt^v — v\\oo = miuTren \\LnV - v\\oo = \\Lv - v\\oo 
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Proof. The existence of an optimal vr that corresponds to a deterministic pohcy follows 
from Lemma 15, the correspondence between policies and values tt, and the existence of a 
deterministic greedy policy. 

Since t; G /C, we have for some policy vr that: 

V > Lv = Lt^v > L^-y. 

Assuming that Lfr < Lv leads to a contradiction since tt is also a feasible solution in the 
linear program (9) and: 

V — Lji^ > V — Lv 

\\v - LjtWoo > \\v - Lv\\oo- 

This proves the lemma. □ 
Theorem 13 now easily follows from the lemmas above. 

Proof. Let vhe a value function with the minimal \\Lv 

— ^llcxD feasible in approximate bilinear 
program (ABP-Lqo), and let tt be a greedy policy with respect to v. Because v > v*, as 
Lemma 38 shows, we get: 

t = \\Lv - v\\oo = \\LvV - v\\oo- 

Let /* be the optimal objective value of (ABP-Loo). Because both v and tt are feasible in 
(ABP-Loo), we have that /* < t. Now, assume that v is an optimal solution of (ABP-Lqo) 
with an objective value / = \\Lv — v\\oo > t- Then, from Lemma 16, f > t > /*, which 
contradicts the optimality of v. 

To show that the optimal policy is deterministic and greedy, let tt* be the optimal policy. 
Then consider the state s for which tt does not define a deterministic greedy action. Prom 
the definition of greedy action a: 

iLav){s) < {LiV){s). 

From the bilinear formulation (ABP-Loo); it is easy to show that there is an optimal solution 
such that: 

{Lav){s) < ~X' + X{s,a) 
A(s, a) < A(s, a). 

Then setting tt (s, a) = 1 and all other action probabilities to 0, the difference in the objective 
value function: 

A(s,a) - ^ A(s,a) < 0. 

Therefore, the objective function for the deterministic greedy policy does not increase. 
The remainder of the theorem follows directly from Proposition 40, Proposition 41, and 
Proposition 42. The bounds on the policy loss then follow directly from Theorem 7. □ 
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4.2 Expected Policy Loss 

This section describes bilinear programs that minimize expected policy loss for a given initial 
distribution \\v ~ Lv\\i^a- The initial distribution can be used to derive tighter bounds on 
the policy loss. We describe two formulations. They respectively minimize an Lqo and a 
weighted Li norm on the Bellman residual. 

The expected policy loss can be minimized by solving the following bilinear formulation. 

min 7r"'"A + A' — (1 — ■y)a^v 

IT I X,X',V 

s.t. Btt = 1 Av-b>0 

TT > X + X'l>Av-b (ABP-Li) 

A, A' > 

V eM 



Notice that this formulation is identical to the bilinear program (ABP-Loo) with the 
exception of the term —(1 — 7)0;^^. 

Theorem 17. Given Assumption 10, any optimal solution {tt,v,\,\') of the approximate 
bilinear program (ABP-Li) satisfies: 

{t^'^\ + A'^ — oi^ V = \\Lv — v\\oo — oJv < min f \\Lv — u||oo — a^v] 

I-7V J I-7" " ~veicr\M\l-j J 

< min I \\Lv — v\\r^ — a^v 



veM \ 1 — 7 

Moreover, there exists an optimal solution tt that is greedy with respect to v for which the 
policy loss is bounded by: 

„ / 2 / . 1 

\\v -VT^\\i^a<z, minq \\Lv-v\\oo-\\v - v\\i^a 

1 — 7 \veM 1 — 7 

Notice that the bound in this theorem is tighter than the one in Theorem 13. In 
particular, 11^* — ■u||ia>0, unless the solution of the ABP is the optimal value function. 



Proof. The proof of the theorem is almost identical to the proof of Theorem 13 with two 
main differences. First, the objective function of (ABP-Li) is insensitive to adding a 
constant to the value function: 

Wiv + kl) - L{v + kl)\\oo - {v + kl) = \\v - Lv\\oo - a^v. 

Hence the missing factor 2 when going from minimization over ICCiM to minimization over 
A4. The second difference is in the derivation of the bound on the policy loss, which follows 
directly from Theorem 8. □ 
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The biUnear program formulation in (ABP-Li) can be further strengthened when an 
upper bound on the state-visitation frequencies is available. 

min tt'^UX — a^v 

7r I \,v 

S.t. 377 = 1 Av-b>0 

TT > X>Av-b (ABP-C/) 

A > 
V eM 

Here U : \S\ ■ \A\ x |5| • |^| is a matrix that maps a policy to bounds on state-action 
visitation frequencies. It must satisfy that: 

7r(s,a) = ^ (7r"^?7)(s,a) = Vs G 5 Va G A 

Remark 18. One simple option is to have U represent a diagonal matrix of u, where u is 
the bound for all policies vr G 11. That is: 

U{{s,a),{s',a')) = h^^^ ^'^^ \fs,s' eS a,a' eA. 

I otherwise 

The formal guarantees for this formulation are as follows. 

Theorem 19. Given Assumption 10 and that for all tt ell : X]ag^(7r^t^)(s, a) > 
any optimal solution (tt, v, A, A') of the bilinear program (ABP-U) then satisfies: 

n'^UX — a^v = \\v — Lv\\u — (x^v < min ( \\v — Lv\\u — a^v] . 

' " veJCnM \ ) 

Assuming that U is defined as in Remark 18, there exists an optimal solution tt that is 
greedy with respect to v and: 

\\v* - ^^^r||l,a < ^ (11^ - Lv\\i^u{v) - - v\\i,a) ■ 

Here, u{v) represents an upper bound on the state-action visitation frequencies for a policy 
greedy with respect to value function v. 

Unlike Theorem 13 and Theorem 17, the bounds in this theorem do not guarantee that 
the solution quality does not degrade by restricting the value function to be transitive- 
feasible. 

To prove the theorem we first define the following linear program that solves for the Li 
norm of the Bellman update Ltt for a value function v. 

/i(7r, = min li^UX 

A, A' 

s.t. IX' + X>Av-b (10) 

A > 

The linear program (6) corresponds to the bilinear program (ABP-C/) with a fixed policy 
TT and value function v. Notice, in particular, that ot^v is a constant. 
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Lemma 20. Let value function v be feasible in the bilinear program (ABP-?7), and let tt 
be an arbitrary policy. Then: 

/i(7r,u) > \\L-„v-v\\i^u, 
with an equality for a deterministic policy. 

Proof. The dual of the Unear program (10) program is the foUowing. 

max x^{Av — b) 

X 

S.t. X < U'^TT (11) 

X > 

We have that /i(7r,v) > \\Lt^v — 'y||i,u since x = U'^tt is a feasible solution. To show the 
equality for a deterministic policy vr, let x* be an optimal solution of linear program (11). 
Since Av > b and U is non-negative, an optimal solution satisfies x = C/^tt. The optimal 
value of the linear program thus corresponds to the definition of the weighted Li norm. □ 

The proof of Theorem 19 is similar to the proof of Theorem 13, but using Theorem 8 
instead of Theorem 7 to bound the policy loss. The existence of a deterministic and greedy 
optimal solution tt follows also like Theorem 13, omitting A' and weighing A by u. 



4.3 Hybrid Formulation 

While the robust bilinear formulation (ABP-Lqo) guarantees to minimize the robust ap- 
proximation error it may be overly pessimistic. The bilinear program (ABP-J7), on the 
other hand, optimizes the average performance, but does not provide strong guarantees. It 
is possible to combine the advantages (and disadvantaged) of these programs using a hybrid 
formulation. The hybrid formulation minimizes the hybrid norm of the Bellman residual, 
defined as: 

n 

Mk,c = ^ , ^max Vy(i)c(i)|x(i)|, 

{y I l^y=k,l>y>0} ^ 

where n is the length of vector x and c > 0. It is easy to show that this norm represents the 
c-weightcd Li norm of the k largest components of the vector. As such, it is more robust 
than the plain Li norm, but is not as sensitive to outliers as the norm. Notice that the 
solution may be fractional when k — that is, some elements are counted only partially. 
The bilinear program that minimizes the hybrid norm is defined as follows. 

min TT^UX + kX' 

n\X,y,v 

s.t. Btt = 1 Av - b>0 

TT > A + X'U-^1 >Av-b (ABP-h) 

A, A' > 

V e M 

Here [/ is a matrix that maps a policy to bounds on state-action visitation frequencies, for 
example, as defined in Remark 18. 
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Theorem 21. Given Assumption 10 and U that is defined as in Remark 18, any optimal 
solution {tt, V, A, A') of (ABP-h) then satisfies: 

tt'^UX + k\' = \\Lv - v\\k,u(*) < ^ min^ \\Lv - v\\k,u{v)- 

Here, u{v) represents the upper bound on the state-action visitation frequencies for policy 
greedy with respect to value function v. 

The imphcation of these bounds on the pohcy loss is beyond the scope of this paper, 
but it is Hkely that some form of pohcy loss bounds can be developed. 

The proof of the theorem is almost identical to the proof of Theorem 13 lemma. We 
first define the following linear program, which solves for the required norm of the Bellman 
update for value function v and policy tt. 

h(ir, v) = min tt'^UX + kX' 

s.t. \'U-^l + \> Av -h (12) 

A, A' > 

The linear program (12) corresponds to the bilinear program (ABP-h) with a fixed policy 
TT and value function v. 

Lemma 22. Let v E IC be a transitive-feasible value function and let tt be a policy and U 
be defined as in Remark 18. Then: 

/i(vr,w) > \\v - L^v\\k,u, 
with an equality for a deterministic policy tt. 

Proof. The dual of the linear program (6) program is the following. 

max {Av — b) 

X 

s.t. X < U'^TT 

i"^ (u'^y\< k 

x>0 

First, change the variables in the linear program to x = U'^ z to get: 

max z^U{Av — b) 

z 

s.t. Z < TT 



(13) 



i^z < k 
z>0 



(14) 



using the fact that U is diagonal and positive. 
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The norm||L7r?^ — 'u||fe,c can be expressed as the following linear program: 



max y^XU{Av - h) 

y 

s.t. y<l 



(15) 



Here, the matrix X : \S\ x \S\ ■ \A\ selects the subsets of the Bellman residuals that 
correspond the the policy as defined: 



It is easy to shows that v — Lt^v = X{Av — h). Note that XU = UX from the definition of 



Clearly, when tt G {0, 1} is deterministic the linear programs (14) and (15) are identical. 
When the policy vr is stochastic, assume an optimal solution y of (15) and let z = X^y. 
Then, z is feasible in (14) with the identical objective value, which shows the inequality. □ 

5. Solving Bilinear Programs 

This section describes methods for solving approximate bilinear programs. Bilinear pro- 
grams can be easily mapped to other global optimization problems, such as mixed integer 
linear programs (Horst and Tuy, 1996). We focus on a simple iterative algorithm for solving 
bilinear programs approximately, which also serves as a basis for many optimal algorithms. 

Solving a bilinear program is an NP-complete problem (Bennett and Mangasarian, 
1992). The membership in NP follows from the finite number of basic feasible solutions 
of the individual linear programs, each of which can be checked in polynomial time. The 
NP-hardness is shown by a reduction from the SAT problem. 

There are two main approaches to solving bilinear programs optimally. In the first 
approach, a relaxation of the bilinear program is solved. The solution of the relaxed problem 
represents a lower bound on the optimal solution. The relaxation is then iteratively refined, 
for example by adding cutting plane constraints, until the solution becomes feasible. This 
is a common method used to solve integer linear programs. The relaxation of the bilinear 
program is typically either a linear or semi-definite program (Carpara and Monaci, 2009). 

In the second approach, feasible, but suboptimal, solutions of the bilinear program 
are calculated approximately. The approximate algorithms are usually some variation of 
algorithm 2. The bilinear program formulation is then refined — using concavity cuts (Horst 
and Tuy, 1996) — to eliminate previously computed feasible solutions and solved again. 
This procedure can be shown to find the optimal solution by eliminating all suboptimal 
feasible solutions. 

The most common and simplest approximate algorithm for solving bilinear programs 
is algorithm 2. This algorithm is shown for the general bilinear program (BP-m), where 
f{w, X, y, z) represents the objective function. The minimizations in the algorithm are linear 




when s = s' 



otherwise 



U. 
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Algorithm 2: Iterative algorithm for solving (BP-m) 

1 (xo, wq) random ; 

2 {yo,zo) -(r- arg minj/,2 f{wo, xo, y, z) ; 

3 i ^ 1 ; 

4 while yi-i / j/j or Xi-i / Xi do 

5 ^ argmin{j^^^|^22/+B2^=62 j/,^>o} /(w^j-i, Xj-i, y, z) ; 

6 {xi, Wi) ^ arg min{^_^ | Aix+Biw=bi x,w>o} f{w, x, yi, Zi) ; 

7 i-^ i + l 

8 return /(if;j,a;j,yi,Zj) 



programs which can be easily solved. Interestingly, as we will show in Section 7, algorithm 2 
applied to ABP generalizes a version of API. 

While algorithm 2 is not guaranteed to find an optimal solution, its empirical perfor- 
mance is often remarkably good (Mangasarian, 1995). Its basic properties are summarized 
by the following proposition. 

Proposition 23 (e.g. (Bennett and Mangasarian, 1992)). algorithm 2 is guaranteed to 

converge, assuming that the linear program solutions are in a vertex of the optimality sim- 
plex. In addition, the global optimum is a fixed point of the algorithm, and the objective 
value monotonically improves during execution. 

The proof is based on the finite count of the basic feasible solutions of the individual 
linear programs. Because the objective function does not increase in any iteration, the 
algorithm will eventually converge. 

algorithm 2 can be further refined in case of approximate bilinear programs. For ex- 
ample, the constraint v ^ Ai in the bilinear programs serves just to simplify the bilinear 
program and a value function that violates it may still be acceptable. The following propo- 
sition motivates the construction of a new value function from two transitive-feasible value 
functions. 

Proposition 24. Let vi and V2 be feasible value functions in (ABP-Lqo)- Then the value 
function 

v{s) = m.m{vi{s), V2{s)} 
is also feasible in bilinear program (ABP-Loo)- Therefore v > v* and 

\\v* - v\\oo < min{||v* - wi||oo, \\v* - V2\\oo} ■ 



The proof of the proposition is based on Jensen's inequality and is provided in the 
appendix. Note that v may have a greater Bellman residual than either vi or V2- 

Proposition 24 can be used to extend algorithm 2 when solving ABPs. One option is 
to take the state-wise minimum of values from multiple random executions of algorithm 2, 
which preserves the transitive feasibility of the value function. However, the increasing 
number of value functions used to obtain v also increases the potential sampling error. 
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6. Sampling Gueirantees 

In most practical problems, the number of states is too large to be explicitly enumerated. 
Even though the value function is restricted to be represent able, the problem cannot be 
solved. The usual approach is to sample a limited number of states, actions, and their tran- 
sitions to approximately calculate the value function. This section shows basic properties 
of the samples that can provide guarantees of the solution quality with incomplete samples. 

First, we show a formal definition of the samples and then show how to use them. The 
simplest samples are defined as follows. 

Definition 25. One-step simple samples are defined as: 

S C {(s, a, (si . . . Sn), r{s, a)) \s,s' e S, a G A}, 
where si . . . s„ are selected i.i.d. from the distribution P(s, a). 

More informative samples include the full distribution instead of samples from the dis- 
tribution. While these samples are often unavailable in practice, they are useful in the 
theoretical analysis of sampling issues. 

Definition 26. One-step samples with expectation are defined as follows: 

S C {(s, a, P{s, a),r{s, a))\s e S, a G .4}. 

Membership a state in the samples is denoted simply as s G S or (s, a) G S with the 
remaining variables, such as r(s, a) considered to be available implicitly. 

The sampling models may vary significantly in different domains. The focus of this work 
is on problems with either a fixed set of available samples or a domain model. Therefore, 
we do not analyze methods for gathering samples. We also do not assume that the samples 
come from previous executions, but rather from a deliberate sample-gathering process. 

The samples are used to approximate the Bellman operator and the set of transitive- 
feasible value functions. 

Definition 27. The sam,pled Bellman operator and the corresponding set of sampled 
transitive- feasible functions are defined as: 

(L(u))(s) = max r(s, a) 7 V P(s, a, s')v(s') Vs G S (16) 

{a|(s,a)GS} 

K: = {v\{s,a,P{s,a),r{s,a)) £1:, v{s) > {Lv){s)} (17) 

The less-informative set of samples S can be used as follows. 

Definition 28. The estimated Bellman operator and the corresponding set of estimated 
transitive-feasible functions are defined as: 

1 " 

{L{v)){s)= max r(s,a) + -f-y^ v{si) Vs G S (18) 
{a\{s,a)et} 



K. 



5,a,(si...s„),r(s,a)) G S, v{s) > (Lv)(s)} (19) 
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Notice that operators L and L map value functions to a subset of all states — only 
states that are sampled. The values for other states are assumed to be undefined. 

The samples can also be used to create an approximation of the initial distribution, or 
the distribution of visitation-frequencies of a given policy. The estimated initial distribution 
a is defined as: 



The existing sampling bounds for approximate linear programming focus on bounding 
the probability that a large number of constraints is violated when assuming a distribution 
over the constraints (de Farias and van Roy, 2004). The difficulty with this approach is that 
the bounds on the number of violated constraints do not transform easily to the bounds on 
the quality of the value function, or the policy. In addition, the constraint distribution is 
often somewhat arbitrary because it is difficult to define and sampling from the appropriate 
distributions. 

Our approach, on the other hand, is to define properties of the sampled operators 
that guarantee that the sampling error bounds are small. These bounds do not rely on 
distributions over constraints and transform directly to bounds on the policy loss. To 
define bounds on the sampling behavior, we propose the following assumptions. The first 
assumption limits the error due to missing transitions in the sampled Bellman operator L. 

Assumption 29 (Constraint Sampling Behavior). For all representable value functions 
V eM: 



The second assumption quantifies the error on the estimation of the transitions of the 
estimated Bellman operator L. 

Assumption 30 (Constraint Estimation Behavior). For all representable value functions 
V & Ai the following holds: 



These assumptions are intentionally made generic so that they apply to a wide range 
of scenarios. Domain specific assumptions are likely to lead to much tighter bounds, but 
these are beyond the scope of this paper. 

Although we define the sampled Bellman operator directly, in practice only its approx- 
imate version is typically estimated. The direct definitions are defined only for the sake of 
theoretical analysis. The sampled matrices used in bilinear program (ABP-Lqo) are defined 
as follows for all (sj, aj) G S. 




otherwise 




ICC ICC /C(ep) 



ICi-es) CKC iC{es). 




B{s',{si,aj)) = l{s' = Si} ys'et 
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The ordering over states in the definitions above is also assumed to be consistent. The 
sampled version of the bilinear program (ABP-Loo) is then: 

min 7r"'"A + A' 

TT I \,\' ,X 

s.t. Btt = 1 A^x - 6 > 

(s-ABP-Loo) 

TT > A + A'l > A^x - b 

A, y > 

The size of the bilinear program (s- ABP-Loo) scales with the number of samples and 
features, not with the size of the full MDP, because the variables A and tt are defined only 
for state — action pairs in S. That is |-7r| = |A| = |{(,s, a) G The number of constraints 
in (s-ABP-Loo) is approximately three times the number of variables A. Finally, the number 
of variables x corresponds to the number of approximation features. 

Theorem 13 shows that sampled robust ABP minimizes \\v — -LuHoo or — L?;||oo, 
depending on the samples Tised. It is then easy to derive sampling bounds that rely on the 
sampling assumptions defined above. 

Theorem 31. Let the optimal solutions to the sampled and precise Bellman residual min- 
imization problems be: 

vi & min llu — Lulloo V2 & min llu — Lulloo V3 € min llu — Lulloo 

veMnK veMnK veMnK 

Value functions vi, V2, vs correspond to solutions of instances of robust approximate bilinear 
programs for the given samples. Also let Vi = Vt^^ , where tt^ is greedy with respect to vi . Then, 
given Assumptions 10, 29, and 30, the following holds: 

vi\\oo < :; mm \\v - Lv\\^ 

1 — 7 veM 

- V2\\oo < T~ — (min \\v - Lv\\oc + 
1 - V-ueX 



2 


1 




7 




2 




1 




7 




2 




1 




7 



- vsWoo < z min - Lv\\oo + e« + 2e; 

These bounds show that it is possible to bound policy loss due to incomplete samples. 
As mentioned above, existing bounds on constraint violation in approximate linear pro- 
gramming (de Farias and van Roy, 2004) typically do not easily lead to policy loss bounds. 

Sampling guarantees for other bilinear program formulations are very similar. Because 
they also rely on an approximation of the initial distribution and the policy loss, they require 
additional assumptions on uniformity of state-samples. 

Proof. We show bounds on \\vi — Lvi\\oo', the remainder of the theorem follows directly from 
Theorem 13. The second inequality follows from Assumption 29 and Lemma 36, as follows: 

V2 - LV2 < f 2 - LV2 

< Vi — Lvi 

< vi — Lvi + epl 
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The second inequality follows from Assumptions 29, 30 and Lemma 36, as follows: 

V3 - Lvs < f 2 - Lv2 + epl 

< f 2 - Lv2 + e^l + gpl 
<vi- Lvi + esl + epl 
<v\- Lvi + 2esl + epl 

Here, we use the fact that Vi > Lvi and that Vi's minimize the corresponding Bellman 
residuals. □ 



To summarize, this section identifies basic assumptions on the sampling behavior and 
shows that approximate bilinear programming scales well in the face of uncertainty caused 
by incomplete sampling. More detailed analysis will need to focus on identifying problem- 
specific assumptions and sampling modes that guarantee the basic conditions, namely sat- 
isfying Assumption 30 and Assumption 29. Such analysis is beyond the scope of this paper. 



7. Discussion and Related ADP Methods 

This section describes connections between approximate bilinear programming and two 
closely related approximate dynamic programming methods: ALP, and Loo-API, which are 
commonly used to solve factored MDPs (Guestrin et al., 2003). Our analysis sheds light 
on some of their observed properties and leads to a new convergent form of approximate 
policy iteration. 

Approximate bilinear programming addresses some important issues with ALP: 1) ALP 
provides value function bounds with respect to Li norm, which does not guarantee small 
policy loss, 2) ALP's solution quality depends significantly on the heuristically-chosen ob- 
jective function c in (ALP-Li) (de Farias, 2002), 3) the performance bounds involve a 
constant 1/(1 — 7) which can be very large when 7 is close to 1 and 4) incomplete con- 
straint samples in ALP easily lead to unbounded linear programs. The drawback of using 
approximate bilinear programming, however, is the higher computational complexity. 

The first and the second issue in ALP can be addressed by choosing a problem-specific 
objective function c (de Farias, 2002). Unfortunately, all existing bounds require that 
c is chosen based on the optimal ALP solution for c. This is impossible to compute in 
practice. Heuristic values for c are used instead. Robust approximate bilinear program 
(ABP-L 00), on the other hand, has no such parameters. On the other hand, the expcetcd- 
loss bilinear program (ABF-U) can be seen as a method for simultaneously optimizing c 
and the approximate linear program. 

The fourth issue in approximate linear programs arises when the constraints need to be 
sampled. The ALP may become unbounded with incomplete samples because its objective 
value is defined using the Li norm on the value function, and the constraints are defined 
using the Loo norm of the Bellman residual. In approximate bilinear programs, the Bellman 
residual is used in both the constraints and objective function. The objective function of 
ABP is then bounded below by for an arbitrarily small number of samples. 

The NP-completeness of ABP compares unfavorably with the polynomial complexity 
of ALP. However, most other approximate dynamic programming algorithms are not guar- 
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anteed to converge to a solution in finite time. As we show below, the exponential time 
complexity of ABP is unavoidable (unless P = NP). 

The following theorem shows that the computational complexity of the ABP formulation 
is asymptotically the same as the complexity of tightly approximating the value function. 

Theorem 32. Assume < 7 < 1, and a given e > 0. Then it is NP-complete to determine: 

min \\Lv — v\\cx} < e min IlLv — u||oo < £• 
veKnM veM 

The problem remains NP-complete when Assumption 10 is satisfied. It is also NP-complete 
to determine: 

min \\Lv — v\\oo — \\v* — v\\i a < ^ min \\Lv — v\\iu — \\v* — v\\i a < ^, 
veM ' veM ' ' 

assuming that u is defined as in Remark 18. 

As the theorem states, the value function approximation does not become computation- 
ally simpler even when Assumption 10 holds — a universal assumption in the paper. Notice 
that ALP can determine whether min^g^cn^l 11-^^^ — t^||oo = in polynomial time. 

The proof of Theorem 32 is based on a reduction from SAT and can be found in Sec- 
tion A.2. The policy in the reduction determines the true literal in each clause, and the 
approximate value function corresponds to the truth value of the literals. The approxima- 
tion basis forces literals that share the same variable to have consistent values. 

Approximate bilinear programming can also improve on API with minimization 
(Loo-API for short), which is a leading method for solving factored MDPs (Guestrin et al., 
2003). Minimizing the L^o approximation error is theoretically preferable, since it is com- 
patible with the existing bounds on policy loss (Guestrin et al., 2003). The bounds on value 
function approximation in API are typically (Munos, 2003): 

27 

lim sup - Vk\\oo < -j— -^limsup \\vk - Vk\\oo- 

These bounds are looser than the bounds on solutions of ABP by at least a factor of 1/(1—7). 
Often the difference may be up to 1/(1 — 7)^ since the error Hvfe — Vfe||oo may be significantly 
larger than \\vk — Lv^Woo- Finally, the bounds cannot be easily used, because they only hold 
in the limit. 

We propose Optimistic Approximate Policy Iteration (OAPI), a modification of API. 
OAPI is shown in algorithm 1, where Z{'k) is calculated using the following program: 

min cj) 
4>,v 

s.t. Av>b (= (I - -fPn)v >r^ Vtt G n) 
-(I--fP^)v-hl'p>-r^ 
veM 

In fact, OAPI corresponds to algorithm 2 applied to ABP because the linear program (20) 
corresponds to (ABP-Lqo) with a fixed tt (see (8)). Then, using Proposition 23, we get the 
following corollary. 
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Corollary 33. Optimistic approximate policy iteration converges infinite time. In addition, 
the Bellman residual of the generated value functions monotonically decreases. 

OAPI differs from Lqo-API in two ways: 1) OAPI constrains the Bellman residuals by 
from below and by (p from above, and then it minimizes 0. Loo-API constrains the Bellman 
residuals by (p from both above and below. 2) OAPI, like API, uses only the current policy 
for the upper bound on the Bellman residual, but uses all the policies for the lower bound 
on the Bellman residual. 

Loo-API cannot return an approximate value function that has a lower Bellman residual 
than ABP, given the optimality of ABP described in Theorem 13. However, even OAPI, an 
approximate ABP algorithm, performs comparably to Lqo-API, as the following theorem 
states. 

Theorem 34. Assume that L^o-API converges to a policy ir and a value function v that 
both satisfy: (f) = \\v — Ltt^^Hoo = ||^ ~ -^"ylloo- Then 

V = V -\ 1 

1-7 

is feasible in the bilinear program (ABP-Loo), and it is a fixed point of OAPI. In addition, 
the greedy policies with respect to v and v are identical. 

Notice that while the optimistic and standard policy iterations can converge to the 
same solutions, the steps in their computation may not be identical. The actual results will 
depend on the initialization. 

To prove the theorem, we first consider L00-API2 as a modification of Loo-API. Loo- 
API2 is shown in algorithm 1, where Ziir) is calculated using the following program: 

min (j) 

<j>,v 

s.t. (I - -iPa)v + l(t)>ra 'iae A 
-{l--iP^)v + 14>>-r^ 
V eM 

The difference between linear programs (4) and (27) is that (4) involves only the current 
policy, while (27) bounds (I — 'yPa)v + 'i-4> > ra from below for all policies. Linear program 
(27) differs from linear program (20) by not bounding the Bellman residual from below by 
0. 

Proposition 35. L^-API and L^-APl2 generate the same sequence of policies if the initial 
policies and tie-breaking is the same. 

Proof. The proposition follows simply by induction from Lemma 39. The basic step follows 
directly from the assumption. For the inductive step, let nj = irf, where tt^ and tt^ are 
the policies with (4) and (27). Then from Lemma 39, we have that the corresponding 
value functions vj = vf + cl. Because T^j^i and 7Tf_^_i are chosen greedily, we have that 

The proof of Theorem 34 follows. 
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Proof. The proof is based on two facts. First, v is feasible with respect to the constraints in 
(ABP-Loo)- The Bellman residual changes for all the policies identically, since a constant 
vector is added. Second, because Ljr is greedy with respect to v, we have that v > L^^v > Lv. 
The value function v is therefore transitive-feasible. 

Prom Proposition 45, Lqo-API can be replaced by L00-API2, which will converge to the 
same policy tt. Loo~ 

API2 will converge to the value function 
V = V -\ 1. 

1-7 

From the constraints in (27) we have that v > L-^v. Then, since vr is the greedy policy with 
regard to this value function, we have that v > L^^v > Lv. Thus v is transitive-feasible 
and feasible in (BP-m) according to Lemma 38. The theorem then follows from Lemma 39 
and from the fact that the greedy policy minimizes the Bellman residual, as in the proof of 
Lemma 16. □ 

To summarize, OAPI guarantees convergence, while matching the performance of Loo- 
API. The convergence of OAPI is achieved because given a non-negative Bellman residual, 
the greedy policy also minimizes the Bellman residual. Because OAPI ensures that the 
Bellman residual is always non-negative, it can progressively reduce it. In comparison, 
the greedy policy in Lqo-API does not minimize the Bellman residual, and therefore Loo- 
API does not always reduce it. Theorem 34 also explains why API provides better solutions 
than ALP, as observed in (Guestrin et al., 2003). From the discussion above, ALP can be 
seen as an Li-norm approximation of a single iteration of OAPI. Loo-API, on the other 
hand, performs many such ALP-like iterations. 



8. Experimental Results 

In this section, we validate the approach by applying it to simple reinforcement learning 
benchmark problems. The focus of the paper is on the theoretical properties and the 
experiments are intentionally designed to avoid interaction between the approximation in 
the formulation and approximate solution methods. As Theorem 34 shows, even OAPI, the 
very simple approximate algorithm for ABP, can perform as well as existing methods on 
factored MDPs. 

ABP is an off-policy approximation method, like LSPI (Lagoudakis and Parr, 2003) or 
ALP. That means that the samples can be gathered independently of the control policy. It 
is necessary, though, that multiple actions are sampled for each state to enable the selection 
of different policies. 

First, we demonstrate and analyze the properties of ABP on a simple chain problem 
with 200 states, in which the transitions move to the right or left (2 actions) by one step 
with a centered Gaussian noise of standard deviation 3. The rewards were set to sin(i/20) 
for the right action and cos(i/20) for the left action, where i is the index of the state. 
This problem is small enough to calculate the optimal value function and to control the 
approximation features. The approximation basis in this problem is represented by piece- 
wise linear features, of the form (f){si) = [i — c]^, for c from 1 to 200. The discount factor 
in the experiments was 7 = 0.95 and the initial distribution was a(130) = 1. We verified 
that the solutions of the bilinear programs were always close to optimal, albeit suboptimal. 
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ABP ABPexp ABPh ALP API 



Figure 1: Lqo Bellman residual for the chain problem 

We experimented with the full state-action sample and randomly chose the features. All 
results are averages over 50 runs with 15 features. In the results, we use ABP to denote a 
close-to-optimal solution of robust ABP, ABPexp for the bilinear program (ABP-Li), and 
ABPhyb for (ABP-h) with k = 5. API denotes approximate policy iteration that minimizes 
the L2 norm. 

Figure 1 shows the Bellman residual attained by the methods. It clearly shows that 
the robust bilinear formulation most reliably minimizes the Bellman residual. The other 
two bilinear formulations are not much worse. Notice also the higher standard deviation of 
ALP and API. Figure 2 shows the expected policy loss, as specified in Definition 5, for the 
calculated value functions. It confirms that the ABP formulation outperforms the robust 
formulation, since its explicit objective is to minimize the expected loss. Similarly, Figure 3 
shows the robust policy loss. As expected, it confirms the better performance of the robust 
ABP formulation in this case. 

Note that API and ALP may achieve lower policy loss on this particular domain than 
ABP formulations, even though their Bellman residual is significantly higher. This is pos- 
sible since ABP simply minimizes bounds on the policy loss. The analysis of tightness of 
policy loss bounds is beyond the scope of this paper. 

In the mountain-car benchmark, an underpowered car needs to climb a hill (Sutton 
and Barto, 1998). To do so, it first needs to back up to an opposite hill to gain sufficient 
momentum. The car receives a reward of 1 when it climbs the hill. The discount factor in 
the experiments was 7 = 0.99. 

The experiments are designed to determine whether OAPI reliably minimizes the Bell- 
man residual in comparison with API and ALP. We use a uniformly-spaced linear spline 
to approximate the value function. The constraints were based on 200 uniformly sampled 
states with all 3 actions per state. We evaluated the methods with the number of the 
approximation features 100 and 144, which corresponds to the number of linear segments. 

The results of robust ABP (in particular OAPI), ALP, API with L2 minimization, and 
LSPI are depicted in Table 1. The results are shown for both Loo norm and uniformly- 
weighted L2 norm. The run-times of all these methods are comparable, with ALP being 
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ABP ABPexp ABPh ALP API 

Figure 2: Expected policy loss for the chain problem 



ABP ABPexp ABPh ALP API 

Figure 3: Robust policy loss for the chain problem 

the fastest. Since API (LSPI) is not guaranteed to converge, we ran it for at most 20 
iterations, which was an upper bound on the number of iterations of OAPI. The results 
demonstrate that ABP minimizes the L^o Bellman residual much more consistently than 
the other methods. Note, however, that all the considered algorithms would perform sig- 
nificantly better given a finer approximation. 

9. Conclusion and Future Work 

We proposed and analyzed approximate bilinear programming, a new value-function ap- 
proximation method, which provably minimizes bounds on policy loss. ABP returns the 
optimal approximate value function with respect to the Bellman residual bounds, despite 
being formulated with regard to transitive-feasible value functions. We also showed that 
there is no asymptotically simpler formulation, since finding the closest value function and 
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(a) Loo error of the Bellman residual (b) L2 error of the Bellman residual 



Features 


100 


144 


Features 


100 


144 


OAPI 


0.21 (0.23) 


0.13 (0.1) 


OAPI 


0.2 (0.3) 


0.1 (1.9) 


ALP 


13. (13.) 


3.6 (4.3) 


ALP 


9.5 (18.) 


0.3 (0.4) 


LSPI 


9. (14.) 


3.9 (7.7) 


LSPI 


1.2 (1.5) 


0.9 (0.1) 


API 


0.46 (0.08) 


0.86 (1.18) 


API 


0.04 (0.01) 


0.08 (0.08) 



Table 1: Bellman residual of the final value function. The values are averages over 5 exe- 
cutions, with the standard deviations shown in parentheses. 



solving a bilinear program are both NP-complete problems. Finally, the formulation leads 
to the development of OAPI, a new convergent form of API which monotonically improves 
the objective value function. 

While we only discussed approximate solutions of the ABP, a deeper study of bilin- 
ear solvers may render optimal solution methods feasible. ABPs have a small number of 
essential variables (that determine the value function) and a large number of constraints, 
which can be leveraged by the solvers (Petrik and Zilberstein, 2007). The Lqo error bound 
provides good theoretical guarantees, but it may be too conservative in practice. A similar 
formulation based on L2 norm minimization may be more practical. 

We believe that the proposed formulation will help to deepen the understanding of 
value function approximation and the characteristics of existing solution methods, and 
potentially lead to the development of more robust and widely-applicable reinforcement 
learning algorithms. 
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Appendix A. Proofs 

A.l Properties of Transitive-Feasible Value Functions 

Basic properties of the Bellman operator, which we often use without a reference are the 
following. 

Lemma 36. Let v be any value function and let c be a scalar. Then: 

L{v + cl) = Lv + 7cl. 

Lemma 37 (Monotonicity). Let P be a stochastic matrix. Then both the linear operators 
P and {I — 7-P)~^ are monotonous: 

X > y ^ Px > Py 

x>y^{I- -fPy^x >{I- jPy^y 

for all x and y. 

Lemma 4. Transitive feasible value functions form an upper bound on the optimal value 
function. If v E /C(e) is an e-transitive-feasible value function, then 

v>v*--^l. 
1-7 



Proof. Let P* and r* be the transition matrix and the reward vector of the policy. Then, 
we have using Lemma 37: 

v > Lv — el 
V > jP*v + r* - el 
(I - -fP*)v >r* - el 

v>{I-jP*)-\*- ^ 



1-7 

□ 

Lemma 38. A value function v satisfies Av > b if an only if v > Lv. In addition, if v is 
feasible in (ABP-Lqo), then v > v*. 

Proof. The backward implication of the first part of the lemma follows directly from the 
definition. The forward implication follows by an existence of A = 0, A' = || [Av — r]_^_ ||oo, 
which satisfy the constraints. The constraints on tt are independent and therefore can be 
satisfied independently. The second part of the lemma also holds in ALPs (de Farias, 2002) 
and is proven identically. □ 
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The minimization min^g^^f ||Li; — uHoo for a policy tt can be represented as the following 
linear program. 

min 

(j>,V 

s.t. (I - jP^)v + 14>>r^ ^22) 
-{I--fP^)v + l(p> -r^ 

V E M 

Consider also the following linear program. 

min (f) 

(l>,V 

s.t. (I - -fP^)v > ^^^^ 
-{I-^P^)v + l(j)>-r^ 

V eM 

Next we show that the optimal solutions of (22) and (23) are closely related. 

Lemma 39. Assume Assumption 10 and a given policy tt. Let (f)i,vi and (f)2,V2 optimal 
solutions of linear programs (22) and (23) respectively. Define: 

Vl =Vl + V2 = Vl 



1-7 ^ ' 2(1-7)^ 

Then: 

1. 2(t)i = (1)2 

2. Vl is an optimal solution in (23). 

3. V2 is an optimal solution in (22). 

4- Greedy policies with respect to vi and vi are identical. 
5. Greedy policies with respect to V2 and V2 are identical. 

Proof. Let (f)i = 2(j)i and 02 = We first show 0i, -ui is feasible in (23). It is representable 
since 1 € and it is feasible by the following simple algebraic manipulation: 

(I - ^P^)vi = (I - ^P^)vi + (I - t-Ptt) 

1-7 

= (I-7P,)i;i + 0il 

> -0il + r^ + 0il 

and 

-(I - -iP^)vi + 0il = -(I - ^P^)vi + 201 1 

= -(I - ^P^)vi - (I - ^P^)-^l + 201 1 

1-7 

= -(I-7P^)i;i-0il + 2</.il 

> -0il-r^ + 20il 
= —r^r 
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Next wc show that 4>2,V2 is feasible in (22). This solution is representable, since 1 e M, 
and it is feasible by the following simple algebraic manipulation: 

(I - -fP„)v2 + ^1 = (I - jP^)v2 - (I - + y 1 

= (I-7P>2-yl + ^<^2l 

= (I - JP^)V2 

> 

and 

-(I - JP^)V2 + 02l = -(I - 7Pn)v2 + y 1 

= -(I - JP^)V2 - (I - 7P.)^1 + ^1 

= -(I-7P^)v2 + </)2l 
> -r^T 

It is now easy to shows that 0i , vi is optimal by contradiction. Assume that there exists 
a solution ^2 < <t>i- But then: 

202 <4>2<h< 201, 

which is a contradiction with the optimality of The optimality of (^2,^2 can be shown 
similarly. □ 

Proposition 40. Assumption 10 implies that: 

min \\Lv — v\\oo < 2 min \\Lv — v\\oo- 

veMnK v£M 

Proof. Let v be the minimizer of = min^g^ \\Lv — v\\oo, and let tt be a policy that is 
greedy with respect to v. Define: 

6 

V = i) + 



1-7 

Then from Lemma 39: 

1. Value function v is an optimal solution of (23): v > Lt^v 

2. Policy TT is greedy with regard to v: L^^v > Lv 

3. \\L-^v - v\\oo = 20 

Then using a simple algebraic manipulation: 

V > L^v = Lv 

and the proposition follows from Lemma 38. □ 

Proposition 41. Let v be a solution of the approximate bilinear program (ABP-Loo) and 
let: 

, 1/2 „ 

(1-7) 

Then: 
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1. \\Lv'-v'\\^= 

2. Greedy policies with respect to v and v' are identical. 
The proposition follows directly from Lemma 39. 

Proposition 42. Assumption 10 implies that: 

min \\Lv — v||oo < (1 + 7) 11^ ~ ^*||oo- 

vi^M veM 

Proof. Assume that v is the minimizcr of miuy^j^ \\v — f*||oo ^ e. Then: 

V* - el < V <v* + el 

Lv* — 7el < Lv < Lv* + 'jel 

Lv* — 7el — V < Lv — v < Lv* + 7el — v 

Lv* - V* - {1 + j)el < Lv-v < Lv* - V* + {1 + -f)el 

-(l + 7)el< Lv-v <(l + 7)el. 



□ 



Proposition 24. Let vi and V2 be feasible value functions in (ABP-Lqo)- Then the 
value function 

v{s) = mm{vi{s) , V2{s)} 
is also feasible in bilinear program (ABP-Loo). Therefore v > v* and 

\\v* - vWoo < min{||i;* - vi\\oo, \\v* - V2\\oo} ■ 



Proof. Consider a state s and action a. Then from transitive feasibility of the value functions 
vi and V2 we have: 



s'eS 

V2{s) > 7 X -^(*'' ^)^2{s') + r{s, a). 
s'&S 

Prom the convexity of the min operator we have that: 

min "I P{s',a,a)vi,^2 ^ P{s',a,a)min{vi{s'),V2){s')}. 

{s'eS s'&s } s'&s 

Then the proposition follows by the following simple algebraic manipulation: 

V = mm{vi{s),V2is)} > 7min< ^ P{s',a,a)vi, ^ P(s', a, 0)^2(5') > +r{s,a) 

[s'eS s'eS J 

> 7 X P{s',a,a)mm{vi{s'),V2){s')} + r{s,a) 

s'eS 

= 7 X ^' «)*(^) + ^(^^ 
s'eS 

□ 
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Lemma 43. Let u-^ he the state-action visitation frequency of policy tt. Then: 

1-7 

Proof. Let Ua{s) = u^ris, 7r(s, a)) for all states s e S and actions a e A. The lemma follows 
as: 

^nI(I-7Pa) = c"r 

aeA 

Y,uliI-jPa)l = c''l 

a&A 

aeA 

1-7 

□ 

A. 2 NP-Completeness 

Proposition 44 (e.g. (Mangasarian, 1995)). A bilinear program can be solved in NP time. 

There is an optimal solution of the bilinear program such that the solutions of the 
individual linear programs are basic feasible. The set of all basic feasible solutions is finite, 
because the feasible regions of w, x and y, z are independent. The value of a basic feasible 
solution can be calculated in polynomial time. 

Theorem 32. Assume < 7 < 1, and a given e > 0. Then it is NP- complete to 
determine: 

min \\Lv — v\\od < e min \\Lv — v\\oo < e. 
veKnM veM 

The problem remains NP-complete when Assumption 10 is satisfied. It is also NP-complete 
to determine: 

min \\Lv — ■ulloo — \\v* — v\\i a < ^ \\Lv — v\\iu— \\v* — v\\i a < ^, 

veM ' veM ' ' 

assuming that u is defined as in Remark 18. 

Proof. The membership in NP follows from Theorem 13 and Proposition 44. We show NP- 
hardness by a reduction from the 3SAT problem. We first don't assume Assumption 10. 
We show the theorem for e = 1. The appropriate e can be obtained by simply scaling the 
rewards in the MDP. 

The main idea is to construct an MDP and an approximation basis, such that the 
approximation error is small whenever the SAT is satisfiable. The value of the states will 
correspond to the truth value of the literals and clauses. The approximation features (j) 
will be used to constraint the values of literals that share the same variable. The MDP 
constructed from the SAT formula is depicted in Figure 4. 
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Figure 4: MDP constructed from the corresponding SAT formula. 



Consider a SAT problem with clauses Cf. 

i=l,...,n i=l,...,n 

where lij are literals. A literal is a variable or a negation of a variable. The variables in 
the SAT The corresponding MDP is constructed as follows. It has one state 

s{lij) for every literal lij, one state s{Ci) for each clause Cj and an additional state s. That 
is: 

S = {s{Ci) \ i = l,...,n}U {s{lij) |i = 1,. . . ,n,j = 1, . . . , 3} U {s}. 

There are 3 actions available for each state s(Cj), which determine the true literal of the 
clause. There is only a single action available in states s{lij) and s. All transitions in the 
MDP are deterministic. The transition t{s, a) = {s', r) is from the state s to s', when action 
a is taken, and the reward received is r. The transitions are the following: 



t{s{Ci),aj) = (s(Zjj),l-7) 
t{s{lij),a) = {s{lij),-{l--f)) 
t{s,a) = (s,2-7) 

Notice that the rewards depend on the discount factor 7, for notational convenience. 
There is one approximation feature for every variable Xk such that: 

Ms{Ci)) = 

^k{s) = 



(24) 
(25) 
(26) 



<t>k{s{kj)) 



if I 
-1 ifZ 



1-3 



1-3 



An additional feature in the problem (p is defined as: 

mQ)) = 1 

4>{s{hj)) = 

m = 1- 
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The purpose of state s is to ensure that v{s{ci)) > 2 — 7, as we assume in the remainder of 
the proof. 

First, we show that if the SAT problem is satisfiable, then min^g^nA: \\Lv — v\\oo < 1- 
The value function v e K. is constructed as a linear sum of the features as: v = where 
y = {yi, . . . ,ym,y)- Here yk corresponds to (pk and y corresponds to (f>. The coefficients yk 
are constructed from the truth value of the variables as follows: 



Vk 



7 if Xjfc = true 
—7 if Xk = false 
y = 2-7. 

Now define the deterministic policy tt as: 

7r(s(Ci)) = aj where lij = true . 

The true literals are guaranteed to exist from the satisfiability. This policy is greedy with 
respect to v and satisfies that WL^^v — 'SI 1 00 < 1 — 7^- 

The Bellman residuals for all actions and states, given a value function v, are defined 

as: 

v{s) — 'yv{s') — r, 

where t{s,a) = {s',r). Given the value function v, the residual values are: 



t{s{Ci),aj) = {s{lij),l-j) : 



t{s{lij),a) = (s(Zy),(l-7)) 



2 - 7 - 7^ + (1 - 7) = 1 - 72 if = true 
^2 -7 + 72 + (1- 7) = 1 + 72 if Zij- = false 

7 — 7^ + 1 — 7 = 1 — 7^ if = true 



[-7 + 7^ + 1-7 = (1-7)^ >0 if lij = false 
i(s,a) = (s,l-7) : (l-7) + 7-l = 



It is now clear that tt is greedy and that: 

11^^^ - ^'lloo = 1 - 7^ < 1. 

Wc now show that if the SAT problem is not satisfiable then min^gx:nA4 H^'W — ^'Hoo ^ 
1 — Now, given a value function v, there are two possible cases for each v{s{lij)): 1) a 
positive value, 2) a non-positive value. Two literals that share the same variable will have 
the same sign, since there is only one feature per each variable. 

Assume now that there is a value function v. There are two possible cases we analyze: 
1) all transitions of a greedy policy are to states with positive value, and 2) there is at least 
one transition to a state with a non-positive value. In the first case, we have that 

yi3j, i{s{lij)) > 0. 

That is, there is a function q{i), which returns the positive literal for the clause j. Now, 
create a satisfiable assignment of the SAT as follows: 



Xk 



true if = Xk 
false if Zjq(j) = ^Xfe 
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with other variables assigned arbitrary values. Given this assignment, all literals with 
states that have a positive value will be also positive. Since every clause contains at least 
one positive literal, the SAT is satisfiable, which is a contradiction with the assumption. 
Therefore, there is at least one transition to a state with a non-positive value. 

Let Ci represent the clause with a transition to a literal In with a non-positive value, 
without loss of generality. The Bellman residuals at the transitions from these states will 
be: 

h = v{s{ln)) - 7v{s{ln)) + {1 - -f) > - + {1 - j) = I - j 
bi = v{s{Ci)) - jv{s{ln)) -(l-7)>2-7-0-l + 7 = l 

Therefore, the Bellman residual v is bounded as: 

11-^^ ~ '"lloo ^ max{6i, 62} > 1. 

Since we did not make any assumptions on v, the claim holds for all representable and 
transitive-feasible value functions. Therefore, min^g^^nyc \\Lv 

— f ||oo ^ 1 — and only if 

the 3-SAT problem is feasible. 

It remains to show that the problem remains NP-complete even when Assumption 10 
holds. Define a new state si with the following transition: 

t{s2,a) = (S2,-|)- 

All previously introduced features (p are zero on the new state. That is = ^(si) = 0. 

The new constant feature is: ^(s) = 1 for all states s e S, and the matching coefficient is 
denoted as yi. When the formula is satisfiable, then clearly min„g_A/(nA: \\Lv — v\\oo < 1 — 7^ 
since the basis is now richer and the Bellman error on the new transition is less than 1 — 7^ 
when yi = 0. 

Now we show that when the formula is not satisfiable, then: 

min \\Lv — v\\r^ > 1 . 

veMnK " " - 2 

This can be scaled to an appropriate e by scaling the rewards. Notice that 

< < |. 

When yi < 0, the Bellman residual on transitions s{Ci) — >■ s{lij) may be decreased by 
increasing yi while adjusting other coefficients to ensure that v{s{Ci)) = 2 — 7. When 

2 

yi > ^ then the Bellman residual from the state si is greater than 1 — Given the 
bounds on yi, the argument for yk = holds and the minimal Bellman residual is achieved 
when: 

v{siQ)) - Msikj)) - (1 - 7) = v{s{si)) - M^s^)) + ^ 
- - - 7 

2 - 7 - 7yi - (1 - 7) = yi - 72/1 + 2 

7 

yi= 2- 
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Therefore, when the SAT is unsatisfiable, the Bellman residual is at least 1 — 

The NP-completeness of miny^^ \\Lv 

— ^||oo ^ ^ follows trivially from Proposition 40. 
The proof for Hi; — Lt;||oo — oJv is almost identical. The difference is a new state s, such 
that (j){s) = 1 and q(s) = 1. In that case a^v = 1 for all v e M. The additional term thus 
has no effect on the optimization. 

The proof can be similarly extended to the minimization of — Define u{Ci) = 

1/n and u{lij) = 0. Then the SAT problem is satisfiable if an only if \\v — Lv\\ifi = 1 — 7^. 
Note that u, as defined above, is not an upper bound on the visitation frequencies Ut^-. It is 
likely that the proof could be extended to cover the case > by more carefully designing 
the transitions from Q. In particular, there needs to be high probability of returning to Cj 
and u{lij > 0. □ 

A.3 Equivalence of OAPI and API 

We first consider L00-API2 as a modification of Loo-API. L00-API2 is shown in algorithm 1, 
where /(tt) is calculated using the following program: 

min (f) 

lj>,V 

s.t. (I - -iP^)v + 10 > ^^^^ 
-(I-7P^)i; + 10 > -r^ 
V 



Proposition 45. L^-API and L^-APl2 generate the same sequence of policies if the initial 
policies and tie-breaking is the same. 

Proof. The proposition follows simply by induction from Lemma 39. The basic step follows 
directly from the assumption. For the inductive step, let nj = tt?, where and arc 
the policies with (4) and (27). Then from Lemma 39, we have that the corresponding 
value functions vj = vf + cl. Because Trj^^ and irf,^-^ are chosen greedily, we have that 
nj+i = TT^+i. □ 

We are ready now to prove the theorem. 
Theorem 34. Assume that Loo- API converges to a policy tt and a value function v that 
both satisfy: (p = \\v — Lyrf ||oo = \\v — Lv\\oo- Then 

V = V -\- 



1-7- 



is feasible in the bilinear program (ABP-Lqo), and it is a fixed point of OAPI. In addition, 
the greedy policies with respect to v and v are identical. 

Proof. From Proposition 45, Loo- API can be replaced by L00-API2, which will converge to 
the same policy tt. L00-API2 will converge to the value function 

V = V -\ 1. 

1-7 
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Prom the constraints in (27) we have that: 

Then, since tt is the greedy pohcy with regard to this value function, we have that: 

V > Lt^v > Lv. 

Thus V is transitive-feasible and feasible in (BP-m) according to Lemma 38. The theorem 
then follows from Lemma 39 and from the fact that the greedy policy minimizes the Bellman 
residual, as in the proof of Lemma 16. □ 

Notice that while the optimistic and standard policy iterations can converge to the 
same solutions, the steps in their computation may not be identical. The actual results will 
depend on the initialization. 
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